# required libraries for this notebook
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
For this analysis, our group will be working with North Carolina Public Schools Report Card and Statistical Profiles Data sets from the years 2014 to 2017. These data sets encompass information across 4 continuous years of educational attributes in North Carolina, USA. The data is collected from the State of North Carolina at http://ncpublicschools.org and made available from the Belk Endowment Educational Attainment Data Repository for North Carolina Public Schools by Dr. Jake Drew. Among other reasons, the data was collected for evaluation of public-school performance for the purpose of efficiently allocating funds to various educational initiatives. Analyses of such data are important because high-impact educational initiatives that are well-funded contribute to increased graduation rates, increased achievement at the post-secondary level, less crime, and greater economic engagement among young people. For the purpose of this exercise, our focus is on describing and predicting school performance using various school characteristics, such as type of school (elementary, middle, high, or some combination of the three), social demographics, economic demographics, and location between 2014 and 2017. The Belk Foundation's website says, "Our goal is to empower today’s workforce by creating pathways to and through postsecondary education for underrepresented students". For the sake of this analysis, we assume that better performing schools have better outcomes in postsecondary education. With North Carolina's rapidly changing demographics, it is important to take into consideration schools' unique needs when allocating funds to strategic investment initiatives. Here, we explore where funding can be best applied based on educational achievement data.
To achieve this goal, we will explore through visual and mathematical modeling which features best predict the School Performance Grade (SPG Score), a continuous measure of a school's overall success based on test scores and growth measures. The analysis will come in the form of a regression model, and accuracy of the model will be determined by the root mean-squared error (RMSE). Predicting SPG within 15 points could be beneficial to an organization, like Belk Endowment, to efficiently allocate funds to schools.
The data set contains 9,731 records and 259 attributes that are comprised of factors, numbers, and characters. A data definition sheet can be found in Dr. Drew's github repository, https://github.com/jakemdrew/EducationDataNC. To scale the data to be more manageable, our data set includes 20 of the 259 variables.
These variables were chosen after initial EDA and correlation visuals were generated between SPG and all predictors. The data type, category and description are displayed in the following table.
| Attribute | DataType | Description |
|---|---|---|
| SPG Score | float64 | School Performance Grade (number) |
| SPG Grade | object | School Performance Grade (Letter Grade) |
| avg_daily_attend_pct | float64 | Average daily attendance percentage |
| category_cd | object | Category Code of the school level E=Elementary, M=Middle, H=High School, I=Elem/Mid Combo, A=All Schools |
| category_cd_modified | object | New attribute - simplified category_cd to distinguish between school types vs combined school levels |
| crime_per_c_num | float64 | Number of crimes or acts of violence per 100 students at the school level |
| EVAAS Growth Status | object | Moved to Appendix |
| _lea_federal_perpupil_num_ | float64 | Federal expense per pupil at LEA level |
| Majority_Minority | object | New Feature - classifying schools as having majority student body from minority racial groups |
| MinorityFemalePct | float64 | Percentage of femail minorities at the school level |
| MinorityMalePct | float64 | Percentage of male minorities at the school |
| MinorityOverallPct | New Feature | |
| _Proficient_TCHR_Standard 1_Pct_ | float64 | |
| school_type_txt | object | Description of school type |
| Science Score | float64 | Science score at the school level |
| short_susp_per_c_num | float64 | Short term suspensions per 100 students at the school level |
| SPG_Met | object | New Feature |
| student_num | float64 | School size or number of students at the school level |
| tchyrs_0thru3_pct | float64 | Percentage of teachers with 0-3 years of teaching experience |
| Year | int64 | School Year |
schools = pd.read_csv('PublicSchools2014to2017_YZ.csv')
# schools.head()
schools[['SPG Score','SPG Grade','avg_daily_attend_pct','category_cd','crime_per_c_num','EVAAS Growth Status','lea_federal_perpupil_num','MinorityFemalePct','MinorityMalePct','Proficient_TCHR_Standard 1_Pct','school_type_txt','Science Score','short_susp_per_c_num','student_num','tchyrs_0thru3_pct','Year']].head()
# schools.info()
schools[['SPG Score','SPG Grade','avg_daily_attend_pct','category_cd','crime_per_c_num','EVAAS Growth Status','lea_federal_perpupil_num','MinorityFemalePct','MinorityMalePct','Proficient_TCHR_Standard 1_Pct','school_type_txt','Science Score','short_susp_per_c_num','student_num','tchyrs_0thru3_pct','Year']].info(verbose = True)
As mentioned above, the established data files (from 2014 to 2019) provided by Dr. Drew processed from raw data*. Further data wrangling efforts are conducted to (1) make the data files more reproducible for use with machine learning modeling and (2) combine the data files based on calendar years together so we can find trends. We will go into more detail below.
In process (1) we will tackle data wrangling. We refer to Dr. Drew’s data cleaning scripts** including multiple impute and cleaning process. During exploratory data analysis, we manipulated categorical variables to remove outliers. We re-mapped the categorical variable, category_cd, so it could be reproducible. We also filtered the data on school_type_txt which showed an overwhelming favor to Regular Schools. We that, we separated a new data frame just to Regular Schools. The final significant data manipulation is to SPG Score. We filtered out 0’s as it does not make sense to train a model with data that is missing a response variable.
For our process (2) we merged all year-based data files together between 2014 and 2017. Because discrepancies in data 2018/2019, including incomplete data, an inner join process is conducted and the resulted combined dataset that totaled 259 attributes and 9731 rows.
*http://nbviewer.jupyter.org/github/jakemdrew/EducationDataNC/blob/master/2014/School%20Datasets/PublicSchools2014.csv **https://nbviewer.jupyter.org/github/jakemdrew/EducationDataNC/tree/master/2014/Machine%20Learning%20Datasets/Source%20Code/PublicSchools2014_ML.ipynb
It is important to understand the types of schools that are relevant to this analysis. Let's first take a look at school type and see if there are any anomolies (schools for children with disabilities, for example).
pd.DataFrame(schools.school_type_txt.value_counts())
#Distribution of School Category
sns.factorplot("school_type_txt", data=schools, aspect=2,
kind="count",order = schools['school_type_txt'].value_counts().index)
plt.show()
# Data Frame of school type and counts
pd.DataFrame(schools["school_type_txt"].value_counts())
# A second dataframe to hold just Regular Schools
schools2 = schools[schools["school_type_txt"] == 'Regular School']
pd.DataFrame(schools2["school_type_txt"].unique())
According to the table and chart above, the dataset is mostly composed Regulars Schools. We created a new dataframe to hold just this set to run our models on.
Let's now take a look at schools by the age ranges of their students: elementary, middle, high, and various combinations of the three.
schools2['category_cd'] == 'E'
schools2['category_cd'].unique()
schools2['category_cd_modified'] = np.select(
[
schools2['category_cd'] == 'A',
schools2['category_cd'] == 'E',
schools2['category_cd'] == 'H',
schools2['category_cd'] == 'I',
schools2['category_cd'] == 'M'
],
[
'Elem./Mid./High Together',
'Elementary School',
'High School',
'Elem./Mid. Together',
'Middle School'
],
default='Mid./High Together'
)
pd.DataFrame(schools2['category_cd_modified'].unique())
Some of these categories have very little representation in the data. For now, we'll remedy this by lumping the combo groups together.
combo = schools2['category_cd_modified'].str.contains('/', regex=False)
schools2['category_cd_modified'] = np.where(combo, 'Combo', schools2['category_cd_modified'])
pd.DataFrame(schools2.category_cd_modified.value_counts())
We now need to somehow get at the idea of demographic composition of schools. Dr. Drew and his capstone groups have shown that classifying schools as majority-minority when they are composed of >50% non-white students highlights meaningful differences in school performance (likely due to the fact that demographics can serve as a stand-in for economic measures). Let's take the same approach.
schools2['MinorityOverallPct'] = schools2['MinorityMalePct'] + schools2['MinorityFemalePct']
schools2['Majority_Minority'] = np.where(schools2['MinorityOverallPct'] > .5, 1,0)
pd.DataFrame(schools2['Majority_Minority'].value_counts())
# Lets clean up the data to better visualize SPG
SPG = schools2[schools2['SPG Score']!=0]
As we continue our visualizations below, let's clean up SPG Score by excluding 0 values. We will store this in the dataframe SPG.
Here's some summary statistics of the immediate attributes we'll be using for visuals. As we can see, we have a min value of 0 for SPG Score. As stated above, it is unhelpful to train a model on response variables that are effectively missing.
Despite cleaning the dataset, we still have values of 0 in several of our continuous variables. We will not run imputation to fill these in because values of 0 makes sense in context.
schools2[['SPG Score','avg_daily_attend_pct', 'student_num', 'tchyrs_0thru3_pct', 'short_susp_per_c_num', 'MinorityFemalePct', 'MinorityMalePct', 'Majority_Minority']].describe()
# visualize summary statistics on attributes
schools2.boxplot(column=['avg_daily_attend_pct','tchyrs_0thru3_pct', 'MinorityFemalePct','MinorityMalePct'], figsize=(10,10))
plt.show()
# Here we will run a boxplot to show SPG Score by Majority_Minority
schools2.boxplot(column='SPG Score', by='Majority_Minority', figsize=(8,8))
plt.show()
schools2.groupby('Majority_Minority')[['SPG Score']].median()
For schools where majority student body is of minority race, the median SPG Score is around 55 versus an SPG score of 69 for majority student body is of majority race.
#Distribution of School Category
sns.factorplot("category_cd_modified", data=schools2, aspect=2,
kind="count")
plt.show()
In our data wrangling process, we re-mapped category_cd to create category_cd_modified. This captures school types for reproducibility. In the chart above, we see Elementary Schools make up the majority of schools within our dataset. This makes sense as the Elementary Schools feed into Middles Schools and ultimately feeds into High Schools.
remap_cat_dict = {
'C': 'C',
'B': 'B',
'D': 'D',
'F': 'F',
'A': 'A',
'A+NG': 'A'}
SPG["SPG Grade"] = SPG["SPG Grade"].map(remap_cat_dict).astype('category')
pd.DataFrame(SPG["SPG Grade"].value_counts())
sns.factorplot("SPG Grade", data=SPG, aspect=2,
kind="count", order=['A','B','C','D','F'])
plt.show()
#Pie Plot of SPG Grade Distribution
pie_SPG = SPG.groupby('SPG Grade').agg('count')
plt.figure(1, figsize=(20,15))
SPG_labels = pie_SPG.school_type_txt.sort_values().index
SPG_counts = pie_SPG.school_type_txt.sort_values()
plt.pie(SPG_counts, labels=SPG_labels, autopct='%1.1f%%')
plt.show()
#Let's see the distribution of scores with different variables
f, axes = plt.subplots(2, 2,figsize=(20,10))
#SPG grade related to white percentage
sns.boxplot( y="MinorityOverallPct", x= "SPG Grade", data=SPG, orient='v' , ax=axes[0,0],order=['A','B','C','D','F'])
#SPG grade related to Short term suspensions per 100 students
sns.boxplot( y="short_susp_per_c_num", x= "SPG Grade", data=SPG, orient='v' , ax=axes[0,1],order=['A','B','C','D','F'])
#SPG grade related to Crime rate
sns.boxplot( y="crime_per_c_num", x= "SPG Grade", data=SPG, orient='v' , ax=axes[1,0],order=['A','B','C','D','F'])
#SPG grade related to teacher experience
sns.boxplot( y="tchyrs_0thru3_pct", x= "SPG Grade", data=SPG, orient='v' , ax=axes[1,1],order=['A','B','C','D','F'])
plt.show()
A general observation:
#Let's see the bar plots of scores with different variables
fig, axarr = plt.subplots(1,2,figsize=(16,10))
sns.barplot(x='SPG Grade', y='short_susp_per_c_num', data=SPG, order=['A','B','C','D','F'], ax=axarr[0])
sns.barplot(x='SPG Grade', y='crime_per_c_num', data=SPG, order=['A','B','C','D','F'], ax=axarr[1])
plt.show()
It seems as though there are similarities between these categories. We see higher average values in the 'F' score range in the short term suspension and crime per 100 students categories.
#Comparison of multiple variables with less or more minorities in school
f, axes = plt.subplots(1, 3,figsize=(18,10))
sns.boxplot(x= "SPG Grade", y="short_susp_per_c_num",
hue="Majority_Minority", palette=["m", "g"],
data=SPG, ax=axes[0], order=['A','B','C','D','F'])
sns.boxplot(x= "SPG Grade", y="crime_per_c_num",
hue="Majority_Minority", palette=["m", "g"],
data=SPG, ax=axes[1], order=['A','B','C','D','F'])
sns.boxplot(x= "SPG Grade", y="tchyrs_0thru3_pct",
hue="Majority_Minority", palette=["m", "g"],
data=SPG, ax=axes[2], order=['A','B','C','D','F'])
plt.show()
A general observation:
# the cross tab operator provides an easy way to get these numbers
spg_ct = pd.crosstab([ SPG['school_type_txt'],SPG['Majority_Minority'] ],
SPG['SPG Grade'].astype('category'))
print(spg_ct)
Observation: The proportion between Majority and Minority Schools on Grade A-F is apparently different.
spg_ct.plot(kind='bar', stacked=True)
plt.show()
spg_rate = spg_ct.div(spg_ct.sum(1).astype(float), axis=0) # normalize the value
# print survival_rate
spg_rate.plot(kind='barh', stacked=True)
plt.show()
# create a new dataframe
schools3 = schools2[['avg_daily_attend_pct', 'student_num', 'tchyrs_0thru3_pct', 'MinorityFemalePct','MinorityMalePct', 'Majority_Minority']]
pd.plotting.scatter_matrix(schools3[['avg_daily_attend_pct', 'student_num', 'tchyrs_0thru3_pct', 'MinorityFemalePct','MinorityMalePct']], c=schools3['Majority_Minority'],figsize=(15,15))
plt.show()
#Let's see the distribution of scores with different variables
f, axes = plt.subplots(2, 2,figsize=(10,10))
#SPG grade related to white percentage
sns.boxplot( y="MinorityOverallPct", x= "category_cd_modified", data=schools2, orient='v' , ax=axes[0,0],order=['Elementary School','Middle School','High School','Combo'])
#SPG grade related to Short term suspensions per 100 students
sns.boxplot( y="short_susp_per_c_num", x= "category_cd_modified", data=schools2, orient='v' , ax=axes[0,1],order=['Elementary School','Middle School','High School','Combo'])
#SPG grade related to Crime rate
sns.boxplot( y="crime_per_c_num", x= "category_cd_modified", data=schools2, orient='v' , ax=axes[1,0],order=['Elementary School','Middle School','High School','Combo'])
#SPG grade related to teacher experience
sns.boxplot( y="tchyrs_0thru3_pct", x= "category_cd_modified", data=schools2, orient='v' , ax=axes[1,1],order=['Elementary School','Middle School','High School','Combo'])
plt.show()
Now let's do a univariate analysis on the impact of attendance rates on school performance grade. The expectation prior to doing the analysis is that poor attendance rates result in poor school performance.
grid = sns.FacetGrid(SPG, col="Year")
grid.map(plt.scatter, 'avg_daily_attend_pct', 'SPG Score')
plt.show()
As expected, there is a positive correlation between the percentage of students regularly attending class and the overall quality of the school.
Some people theorize that smaller class sizes are associated with better student outcomes. Let's test that by looking at SPG score by number of students at a given school.
grid = sns.FacetGrid(SPG, col='Year')
grid.map(plt.scatter, 'student_num', 'SPG Score')
plt.show()
While this isn't a good measure of class size (we have no way of measuring student/teacher ratio. Just raw number of students), it's interesting to observe that larger public schools generally receive higher grades than smaller ones. It could be the case that here the size of the school is a stand-in for the rurality of the school. In rural areas, schools will be smaller, and there will be fewer opportunities for students. What happens when we color by the majority-minority variable? Of particular interest are high schools.
highschools = SPG[SPG['category_cd_modified'] == 'High School']
groups = highschools.groupby('Majority_Minority')
for name, group in groups:
plt.plot(group["student_num"], group["SPG Score"], marker="o", linestyle="", label=name, alpha=0.2)
plt.xlabel("Student Body Size \n (Students)")
plt.ylabel("SPG Score")
plt.title("School Size vs. SPG Score: High Schools Only \n 2014 - 2017")
plt.legend(title="Majority Minority")
High schools exhibit a particularly interesting trend. There seems to be some benefit to being either an exceptionally small or large high school, but being medium sized seems to be detrimental to the overall school quality. Not sure why this would be.
Now let's turn our attention to teacher hiring. It would be reasonable to hypothesize that having teachers with more experience would result in better educational quality. Let's test it.
for name, group in groups:
plt.plot(group['tchyrs_0thru3_pct'], group['SPG Score'], marker = "o", linestyle="", label=name, alpha=0.2)
plt.xlabel("Percent Teachers With 0-3 Years Experience")
plt.ylabel("SPG Score")
plt.title("Teacher Experience vs. SPG Score: High Schools Only \n 2014 - 2017")
plt.legend(title="Majority Minority")
middleschools = SPG[SPG['category_cd_modified'] == 'Middle School']
groups_m = middleschools.groupby('Majority_Minority')
for name, group in groups_m:
plt.plot(group['tchyrs_0thru3_pct'], group['SPG Score'], marker = "o", linestyle="", label=name, alpha=0.2)
plt.xlabel("Percent Teachers With 0-3 Years Experience")
plt.ylabel("SPG Score")
plt.title("Teacher Experience vs. SPG Score: Middle Schools Only \n 2014-2017")
plt.legend(title="Majority Minority")
elemschools = SPG[SPG['category_cd_modified'] == 'Elementary School']
groups_e = elemschools.groupby('Majority_Minority')
for name, group in groups_e:
plt.plot(group['tchyrs_0thru3_pct'], group['SPG Score'], marker = "o", linestyle="", label=name, alpha=0.2)
plt.xlabel("Percent Teachers With 0-3 Years Experience")
plt.ylabel("SPG Score")
plt.title("Teacher Experience vs. SPG Score: Elem. Schools Only \n 2014-2017")
plt.legend(title="Majority Minority")
Does disciplinary policy effect school quality? Let's take a look at suspensions.
for name, group in groups:
plt.plot(group['short_susp_per_c_num'], group['SPG Score'], marker = "o", linestyle="", label=name, alpha=0.2)
plt.xlabel("Short Suspensions Per 100 Students")
plt.ylabel("SPG Score")
plt.title("Short Term Suspensions vs. SPG Score: High Schools Only \n 2014 - 2017")
plt.legend(title="Majority Minority")
for name, group in groups_m:
plt.plot(group['short_susp_per_c_num'], group['SPG Score'], marker = "o", linestyle="", label=name, alpha=0.2)
plt.xlabel("Short Suspensions Per 100 Students")
plt.ylabel("SPG Score")
plt.title("Short Term Suspensions vs. SPG Score: Middle Schools Only \n 2014 - 2017")
plt.legend(title="Majority Minority")
for name, group in groups_e:
plt.plot(group['short_susp_per_c_num'], group['SPG Score'], marker = "o", linestyle="", label=name, alpha=0.2)
plt.xlabel("Short Suspensions Per 100 Students")
plt.ylabel("SPG Score")
plt.title("Short Term Suspensions vs. SPG Score: Elem. Schools Only \n 2014 - 2017")
plt.legend(title="Majority Minority")
In total, there 259 attributes in the dataset, giving us a wide variety of features we could potentially explore. We can bring outside data as well, for example the GSS (General Social Survey) or the Census dataset, attributes such as income percentage, tax brackets, and other demographics. With these features we can run additional assumptions on how these attributes may contribute to students and their scores.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn import metrics
# creating instance of one-hot-encoder
enc = OneHotEncoder(handle_unknown='ignore')
# passing category_cd_modified
enc_df = pd.DataFrame(enc.fit_transform(schools2[['category_cd_modified']]).toarray())
#Creating working dataframe for regression
X = schools2[['Majority_Minority', 'avg_daily_attend_pct', 'student_num', 'tchyrs_0thru3_pct', 'short_susp_per_c_num']]
X = pd.concat([X.reset_index(drop=True), enc_df], axis=1)
y = schools2['SPG Score']
#Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
#Train model on training set
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
On average the model misses the SPG Score by 11 grade points.
Because we combined the data provided, we were able to create trends between 2014 and 2017. First, we'll look at the trend of SPG score over these years.
f, a = plt.subplots(figsize=(10,10))
sns.lineplot(x="Year", y="SPG Score", data=schools2)
plt.show()
# trendline by category
f, a = plt.subplots(figsize=(10,10))
sns.lineplot(x="Year", y="SPG Score", hue="category_cd_modified", data=schools2)
plt.show()